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To perform regression analysis in high dimensions, lasso or ridge estimation 
are a common choice. However, it has been shown that these methods are 
not robust to outliers. Therefore, alternatives as penalized M-estimation or 
the sparse least trimmed squares (LTS) estimator have been proposed. The 
robustness of these regression methods can be measured with the influence 
function. It quantifies the effect of infinitesimal perturbations in the data. 
Furthermore it can be used to compute the asymptotic variance and the 
mean squared error. In this paper we compute the influence function, the 
asymptotic variance and the mean squared error for penalized M-estimators 
and the sparse LTS estimator. The asymptotic biasedness of the estimators 
make the calculations nonstandard. We show that only M-estimators with a 
loss function with a bounded derivative are robust against regression outliers. 

In particular, the lasso has an unbounded influence function. 
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1 Introduction 

Consider the usual regression situation. We have data (X, y), where X E M nxp is the 
predictor matrix and y E W 1 the response vector. A linear model is commonly fit using 
least squares regression. It is well known that the least squares estimator suffers from 
large variance in presence of high multicollinearity among the predictors. To overcome 
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these problems, ridge |Hoerl and Kennard 1977 and lasso estimation |Tibshirani 1996 
add a penalty term to the objective function of least squares regression 


1 n p 

Plasso = arg min - VVy, - x'/3) 2 + 2A V \Pj 
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In contrast to the ridge estimator that only shrinks the coefficients of the least squares 
estimate 0ls, the lasso estimator also sets many of the coefficients to zero. This increases 
interpretability, especially in high-dimensional models. The main drawback of the lasso 
is that it is not robust to outliers. As Alfons et al. 12013 ] have shown, the breakdown 
point of the lasso is 1/n. This means that only one single outlier can make the estimate 
completely unreliable. 

Hence, robust alternatives have been proposed. The least absolute deviation (LAD) 
estimator is well suited for heavy-tailed error distributions, but does not perform any 
variable selection. To simultaneously perform robust parameter estimation and variable 
selection, Wang et al.| |2007 combined LAD regression with lasso regression to LAD-lasso 
regression. However, this method has a finite sample breakdown point of 1/n Alfons 


et ah, 2013 , and is thus not robust. Therefore Arslan [20121 provided a weighted version 


of the LAD-lasso that is made resistant to outliers by downweighting leverage points. 

A popular robust estimator is the least trimmed squares (LTS) estimator Rousseeuw 


and Leroy, 1987 . Although its simple definition and fast computation make it interesting 


for practical application, it cannot be computed for high-dimensional data (p > n). 
Combining the lasso estimator with the LTS estimator, Alfons et ah [2013] developed 
the sparse LTS-estimator 


PspLTS = arg min ^ A l&l- 
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where r 2 (/3) = (y* —x'/3) 2 denotes the squared residuals and r^(/3) < ... < r 2 ^{(3) their 
order statistics. Here A > 0 is a penalty parameter and h < n the size of the subsample 
that is considered to consist of non-outlying observations. This estimator can be applied 
to high-dimensional data with good prediction performance and high robustness. It also 


has a high breakdown point Alfons et ah, 2013 


All estimators mentioned until now, except the LTS and the sparse LTS-estimator, are 
a special case of a more general estimator, the penalized M-estimator Li et al.| 2011 
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with loss function p : M —>• M and penalty function J : M — > M. While lasso and ridge 
have a quadratic loss function p{z) = z 2 , LAD and LAD-lasso use the absolute value loss 
p(z) = \z\. The penalty of ridge is quadratic J(z) = z 2 , whereas lasso and LAD-lasso 
use an Li-penalty J(z) = |z|, and the ‘penalty’ of least squares and LAD can be seen 
as the constant function J(z) = 0. In the next sections we will see how the choice of 
the loss function affects the robustness of the estimator. In Equation Q, we implicitly 
assume that scale of the error term is fixed and known, in order to keep the calculations 
feasible. In practice, this implies that the argument of the /9-function needs to be scaled 
by a preliminary scale estimate. Note that this assumption does not affect the lasso or 
ridge estimator. 

The rest of the paper is organized as follows. In Section [2j we define the penalized 
M-estimator at a functional level. In Section [3j we study its bias for different penalties 
and loss functions. We also give an explicit solution for sparse LTS for simple regression. 
In Section [4] we derive the influence function of the penalized M-estimator. Section [5] is 
devoted to the lasso. We give its influence function and describe the lasso as a limit case 
of penalized M-estimators with a differentiable penalty function. For sparse LTS we give 
the corresponding influence function in Section [6} In Section [7] we compare the plots of 
influence functions varying loss functions and penalties. A comparison at sample level 
is provided in Section |8j Using the results of Sections EH§ Section [9] compares sparse 
LTS and different penalized M-estimators by looking at asymptotic variance and mean 
squared error. Section [To| concludes. The appendix contains all proofs. 

2 Functionals 

Throughout the paper we work with the typical regression model 

V = x 73 0 + e (5) 

with centered and symmetrically distributed error term e. The number of predictor vari¬ 
ables is p and the variance of the error term e is denoted by cr 2 . We assume independence 
of the regressor x and the error term e and denote the joint model distribution of x and 
y by Hq. Whenever we do not make any assumptions on the joint distribution of x and 
y, we denote it by H. 

The estimators in Section [l] are all defined at the sample level. To derive their influence 
function, we first need to introduce their equivalents at the population level. For the 
penalized M-estimator Q, the corresponding definition at the population level, with 
(x, y) ~ H, is 

p 

Pm( h ) = argmin E H [p(y - x'/3)] + 2AV J(/5 i ) (6) 
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An example of a penalized M-estimator is the ridge functional, for which p(z) = J{z) = 
z 2 . Also the lasso functional 


Plasso( h ) = argmin E H [(y - x'/3) 2 ] + 2A^ |A| (7) 

can be seen as a special case of the penalized M-estimator. However, its penalty is not 
differentiable, which will cause problems in the computation of the influence function. 

To create more robust functionals, different loss functions than the classical quadratic 
loss function p(z) = z 2 can be considered. Popular choices are the Huber function 


Ph{z) = 


2 if \z\ < kn, 

2kn\z\ — k 2 H if \z\ > kn 


( 8 ) 


and Tukey’s biweight function 


1 - (1 - (sfr ) 2 ) 3 if M < k B i , 


PBl(z) = 


1 


if \z\ > ksi- 


(9) 


The Huber loss function pu is a continuous, differentiable function that is quadratic in 
a central region [— ku, kn] and increases only linearly outside of this interval (compare 
Figure [I]). The function value of extreme residuals is therefore lower than with a quad¬ 
ratic loss function and, as a consequence, those observations have less influence on the 
estimate. Due to the quadratic part in the central region, the Huber loss function is still 
differentiable at zero in contrast to an absolute value loss. The main advantage of the 
biweight function psi (sometimes also called ‘bisquared’ function) is that it is a smooth 
function that trims large residuals, while small residuals receive a function value that is 
similar as with a quadratic loss (compare Figure [I]). The choice of the tuning constants 
ksi and kn determines the breakdown point and efficiency of the functionals. We use 
ksi = 4.685 and kn = 1.345, which gives 95% of efficiency for a standard normal error 
distribution in the unpenalized case. To justify the choice of k also for distributions with 
a scale different from 1, the tuning parameter has to be adjusted to ka. 

Apart from the L\- and .^-penalty used in lasso an ridge estimation, respectively, 
also other penalty functions can be considered. Another popular choice is the smoothly 
clipped absolute deviation (SCAD) penalty |Fan and Li, 200l] (see Figure [2]) 


Jscad(P) = < 


2(a—1)A 


+ 


\ (i+i 

A 2 


if \P\ < a, 

if A < \/3\ < aX , 
if 1/31 > aA. 


( 10 ) 


While the SCAD functional, exactly as the lasso, shrinks (with respect to A) small para¬ 
meters to zero, large values are not shrunk at all, exactly as in least squares regression. 
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Figure 2: The smoothly clipped absolute deviation (SCAD) penalty function 


The definition of the sparse LTS estimator at a population level is 


Ps P lts( H ) = argmin E H [(y - *■' P ) 2 1[\y-x' p\< q/3 ]\ + aA Eifti- 
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3 = 1 


with the a-quantile of |y — x 7 /3|. As recommended in Alfons et al. 2013 , we take 
a = 0.75. 


3 Bias 


The penalized M-functional (3 M has a bias 


Bias(/3 M , 77 q) — (3 m (Hq) — j3 0 


( 12 ) 


at the model distribution Hq. The bias is due to the penalization and is also present 
for penalized least squares functionals. Note that there is no bias for non-penalized M- 


functionals. The difficulty of Equation (12) lies in the computation of the functional 


/3 m (Hq). For the lasso functional, there exists an explicit solution only for simple re¬ 
gression (i.e. p = 1) 


Plasso(H) = sign (/3 L s{H))[ \/3 L s{H)\ - ——) . (13) 


A 

E h[x 2 


Here /3ls{H ) = E#[xy]/E#[x 2 ] denotes the least squares functional and (z)+ = max(0, z), 
the positive part function. For completeness, we give a proof of Equation © in the 
appendix. For multiple regression the lasso functional at the model distribution Hq can 
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be computed using the idea of the coordinate descent algorithm (see Section [5]) , with the 
model parameter (3 0 as a starting value. Similarly, also for the SCAD functional there 
exists an explicit solution only for simple regression 


/3 SC AD {H ) 


(\Pls(H)\ - E ^pj)+sign (Pls(H)) 

(a-l)E HQ [x 2 ]/3 LS (H)-a\sign(0 LS (H)) 
(a-l)E HQ [x 2 ]-l 


Pls(H) 


if \p LS (H)\ < X + Eho A [x 2] , 
if A + e^j < \Pls(H)\ <aX, 
if \p LS {H)\ >aX. 

(14) 


This can be proved using the same ideas as in the computation of the solution for the 


lasso functional in simple regression (see Proof of Equation (13) in the appendix). Here 
the additional assumption E#[x 2 ] > l/(a — 1) is needed. As can be seen from Equation 
0, the SCAD functional is unbiased at the model H$ for large values of the parameter 

Po- 

To compute the value of a penalized M-functional that does not use a quadratic loss 


function, the iteratively reweighted least squares (IRLS) algorithm Osborne, 1985 
be used to find a solution. Equation 0 can be rewritten as 

v 

Pm(H) = argmin E H [w(f3)(y - x'/3) 2 ] + 2A^ J(Pj) 


can 


/3eR p 


3 = 1 


with weights w(/3) = p(y — x' (3)/(y — x'/3) 2 . If a value of /3 is available, the weights can 
be computed. If the weights are taken as fixed, (3 M can be computed using a weighted 
lasso (if an Li-penalty was used), weighted SCAD (for a SCAD-penalty) or a weighted 
ridge (if an L 2 -penalty is used). Weighted lasso and weighted SCAD can be computed 
using a coordinate descent algorithm, for the weighted ridge an explicit solution exists. 
Computing weights and (3 M iteratively, convergence to a local solution of the objective 
function will be reached. As a good starting value we take the true value (3 0 . The 
expected values that are needed for the weighted lasso/SCAD/ridge are calculated by 
Monte Carlo approximation. 

For the sparse LTS functional, we can find an explicit solution for simple regression 
with normal predictor and error term. 


Lemma 3.1. Let y = xfio + e be a simple regression model as in 0. Let Hq be the joint 
distribution of x and y, with x and e normally distributed. Then the explicit solution of 
the sparse LTS functional 0 is 

PsrLTsm = sign(/3„) (l/Sol - + < 15 > 

with c\ = a — 2q a (j>(q a ), q a the -quantile of the standard normal distribution and <fi 
its density. 
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Lemma 3.1 gives an explicit solution of the sparse LTS functional for only normally 
distributed errors and predictors, which is a strong limitation. In the general case, with 
x ~ F, e ~ G, and x and e independent, the residual y — x/3 = x(/3q — (3) + e follows a 
distribution Dp(z) = F(z/(Po ~ P)) * G(z) for /3 q > P, where * denotes the convolution. 
Without an explicit expression for Dp, it will be hard to obtain an explicit solution for 
the sparse LTS functional. On the other hand, if Dp is explicitly known, the proof of 


Lemma 3.1 can be followed and an explicit solution for the sparse LTS-functional can be 
found. A case where explicit results are feasible is for x and e both Cauchy distributed, 
since the convolution of Cauchy distributed variables remains Cauchy. Results for this 
case are available from the first author upon request. 

To study the bias of the various functionals of Section [2j we take p = 1 and assume 
x and e as standard normally distributed. We use A = 0.1 for all functionals. Figure [3] 
displays the bias as a function of Pq. Of all functionals used only least squares has a 
zero bias. The Li-penalized functionals have a constant bias for values of Pq that are not 
shrunken to zero. For smaller values of /3 q the bias increases monotonously in absolute 
value. Please note that the penalty parameter A plays a different role for different 
estimators, as the same A yields different amounts of shrinkage for different estimators. 
For this reason, Figure [3] illustrates only the general shape of the bias as a function of 

Po- 


4 The Influence Function 


The robustness of a functional (3 can be measured via the influence function 


IF((x 0 ,y 0 ),f3,H) = ^ 


P(( 1 “ e ) H + e< 5(x 0 ,y 0 )) 


e=0 


It describes the effect of infinitesimal, pointwise contamination in (xo,yo) on the func¬ 
tional (3. Here H denotes any distribution and 5 Z the point mass distribution at z. To 
compute the influence function of the penalized M-functional ([b]), smoothness conditions 
for functions p{-) and «/(•) have to be assumed. 


Proposition 4.1. Let y = x'/3o + e be a regression model as defined in 0 Furthermore, 
let p, J : M —> M be twice differentiable functions and denote the derivative of p by ip := p’. 
Then the influence function of the penalized M-functional /3 M for A > 0 is given by 


IF((xo,yo),P M i H o) = 

= {^H 0 W{y ~ x'/3 M (iL 0 ))xx / ] + 2Adiag(J"(/3 M (lLo))))~ 1 - (16) 

• ~ x , 0 /3 M (iL 0 ))x 0 - EtfotV’O/ - x' (3 m (H 0 ))x]). 


The influence function (16) of the penalized M-functional is unbounded in xo and is only 
bounded in yo if iff) is bounded. In Section [ 7 ] we will see that the effect of the penalty 









Po 


Figure 3: Bias of various functionals for different values of /3o (A = 0.1 fixed). Note that 
the small fluctuations are due to Monte Carlo simulations in the computation 
of the functional. 
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on the shape of the influence function is quite small compared to the effect of the loss 
function. 

As the ridge functional can be seen as a special case of the penalized M-functional ([6]), 
its influence function follows as a corollary: 

Corollary 4.2. The influence function of the ridge functional (Bridge 
^(( x 0 j Ho), Pridgei Ho) = 

{Eh 0 [xx'] + 2A/ p ) (^Vo — x o^b7dgb(-^o)) x o + E# 0 [xx'] Bias((3 RIDGE , i7o)^ • 

(17) 

As the function ip(z) = 2z is unbounded, the influence function ( |17[ ) of the ridge func¬ 
tional is unbounded. Thus the ridge functional is not robust to any kind of outliers. 

The penalty function J{z) := \z\ of the lasso functional and the sparse LTS functional 
is not twice differentiable at zero. Therefore those functionals are no special cases of the 
M-functional used in Proposition |4.1| and have to be considered separately to derive the 
influence function. 


5 The Influence Function of the Lasso 


For simple regression, i.e. for p = 1, an explicit solution for the lasso functional exists, 


see Equation (13). With that the influence function can be computed easily. 


Lemma 5.1. Let y = x/3o + e be a simple regression model as in ©• Then the influence 
function of the lasso functional is 


IF((x 0 ,y 0 ), Plasso, H 0 ) = 


^ [x 2 ] — 00 < [x' 2 ] 


X °vL \%* o) - ^ ^EhIJ 2 ]) 20 , sign ( /3 °) otherwise. 


0 L 


(18) 


Similar to the influence function of the ridge functional (|17[), the influence function of 


the lasso functional (18) is unbounded in both variables xq and yo in case the coefficient 


Plasso is not shrunk to zero (Case 2 in Equation (18)). Otherwise the influence function 
is constantly zero. The reason of the similarity of the influence function of the lasso and 
the ridge functional is that both are a shrunken version of the least squares functional. 

As there is no explicit solution in multiple regression for the lasso functional, its in¬ 
fluence function cannot be computed easily. However, Friedman et al. |2007] and Fu 


1998 found an algorithm, the coordinate descent algorithm (also shooting algorithm ), 
to split up the multiple regression into a number of simple regressions. The idea of the 
coordinate descent algorithm at population level is to compute the lasso functional ([ 7 ]) 
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variable by variable. Repeatedly, one variable j G {1,... ,p} is selected. The value of 
the functional fi cd is then computed holding all other coefficients k j fixed at their 
previous value f3 k 


flf(H) = argmin E H [((y - Xjflfl 2 ] + 2A V \/3* k \ + 2A| 

= argmin E H [((y - Y'zfc/Sfc) - Xjflj) 2 ] + 2X\flj\. 

/3)6R k ^. 


(19) 


re- 


This can be seen as simple lasso regression with partial residuals y — x kP k as 

sponse and the jth coordinate Xj as covariate. Thus, the new value of f3j d (H ) can be 
easily computed using Equation ( [l3| ). Looping through all variables repeatedly, conver¬ 
gence to the lasso functional 0 will be reached for any starting value Friedman et al. 
2007} |Tseng[ |200l] . 


For the coordinate descent algorithm an influence function can be computed similarly 
as for simple regression. However, now the influence function depends on the influence 
function of the previous value (3*. 


Lemma 5.2. Let y = x'/lo+e be the regression model of 0. Then the influence function 
of the jth coordinate of the lasso functional (19) computed via coordinate descent is 


IF((x 0 ,y 0 ),(3f,H 0 ) 




0 if \K Ho [xjy^]\ < A, 

-E gn [x J x^'j F ( (xo , yo ),^*(j), go) ] +fa o_ x OT^.O) (go))(xo) . _ E go fcyO>](xo)3 
E » 0 [*j] ( E H 0 [ x ?b 2 

sign(E Ho [xjyb)]) otherwise , 

(E H 0 {Xj\) 


( 20 ) 

where for any vector z we define z^) = (zi,..., Zj-i, Zj+i ,..., z p )', y^) := y—x^)' /3 *^\Hq), 
with the functional representing the value of the coordinate descent algorithm at 

population level in the previous step. 


To obtain a formula for the influence function of the lasso functional in multiple regres¬ 


sion, we can use the result of Lemma 5.2 The following proposition holds. 


Proposition 5.3. Let y = x'/?o + e be the regression model of 0. Without loss of 
generality let (3 L asso( h o) = {(Plasso( h o))i, • • ■, {PLASSo{ H o))k, 0,... ,0)' with k <p 
and (/3LASSo(Ho))j 0 Vj = 1 ,k. Then the influence function of the lasso functional 
0 is 


IF((xQ,yo),0LASSChHo) — (21) 

Vi/oIxTfcxijr 1 (( x o)i:fc(l/o - x' 0 f3 LAS so(Ho)) - E Ho [x 1:k (y - x' P LA sso( H o))]) 

\ Qp-k 

with the notation z r:s = (z r , z r+ 1 ,..., z s -i,z s )' for z G r,sG{l,...,p} and r < s. 
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p 

Figure 4: Approximation of |/3| using f3 • tanh (Kf5) 


Thus, the influence function of the lasso estimator is zero for variables j with coeffi¬ 
cients {(3 l AS so (Ho)) j shrunk to zero. This implies that for an infinitesimal amount of 
contamination, the lasso estimator in those variables j stays (/ 3lasso(Ho))j = 0 and is 
not affected by the contamination. 

Another approach to compute the influence function of the lasso functional is to con¬ 


sider it as a limit case of functionals satisfying the conditions of Proposition 4.1 The 
following sequence of hyperbolic tangent functions converges to the sign-function 


lim tanh(Ahr) = < 

K —!>-oo 


+1 if x > 0, 

— 1 if x < 0, 

0 if x = 0. 


Hence, it can be used to get a smooth approximation of the absolute value function 


\x\ = x ■ sign(a;) = lim x ■ tanh(/vx). (22) 

K—>oo 

The larger the value of A' > 1, the better the approximation becomes (see Figure [4]). 
Therefore the penalty function JxiPj) = /3j t&nh(K/3j ) is an approximation of JLASSoiPj) 
\/3j\. As Jk is a smooth function, the influence function of the corresponding functional 

p 

Pk( h o) = argmin E Ho [(y - x'/3) 2 ] + 2A V] J K (/3j) (23) 

/3SRP “T 
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can be computed by applying Proposition [hT] Taking the limit of this influence function, 
we obtain the influence function of the lasso functional. It coincides with the expression 


given in Proposition 5.3 


Lemma 5.4. Let y = x! flo + e be the regression model of 0. Without loss of gener¬ 
ality let P lasso (H 0 ) = ((P LAS so(Ho))i, . ■ ■, (P LASSO {H 0 ))k,0, ■ ■ ■ ,0)' with k <p and 
(.0LASSo(Ho))j / OVj = 1 Then the influence function of the penalized M- 

estimator (2^ ) converges to the influence function of the lasso functional given in (21) 
as K tends to infinity. 


6 The Influence Function of sparse LTS 

For sparse LTS, computation of the influence function is more difficult than for the lasso. 
In addition to the nondifferentiable penalty function, sparse LTS also has a discontinuous 
loss function. For simplicity, we therefore assume a univariate normal distribution for 
the predictor x and the error e. However, the below presented ideas can be used to derive 


the influence function also for other distributions (similar as stated below Lemma 3.1). 


Results for Cauchy distributed predictors and errors are available from the first author 
upon request. 

Lemma 6.1. Let y = x(3q + e be a simple regression model as in 0. If x and e are 


normally distributed, the influence function of the sparse LTS functional (15) is 


IF((xo,yo), fl sp LTS,H 0 ) = < 


2ciE? 0 [*a] < - 2qEh 0 [x 2 ] 


a\ 


(Q ( ZJ \ a \ <&( I l\r a \<q a ]- a )(Po-Ps P LTs( H o)) 

[PspLTS(Ll o) - flo) - a- 2 q a ^q a ) - 


+ 


+ 


X0fa0-^T S (^,))/ [k |< gal 


{a-2q a (j>{q a ))^H 0 W 


(24) 


with tq = 


yo — xoP sp LTs(Hg) 


y/v 2 +(ho—l3spLTs{Ho)) 2 ^H 0 [x 2 


and the same notation as in Lemma 


3.1 


Lemma 6.1 shows that the influence function of the sparse LTS functional may become 
unbounded for points (xo,yo) that follow the model, i.e. for good leverage points, but 
remains bounded elsewhere, in particular for bad leverage points and vertical outliers. 
This shows the good robust properties of sparse LTS. 


We can also see from Equation (24) that the influence function of the sparse LTS 
functional is zero if the functional is shrunken to zero, i.e. if |/3 q| < 2 C1 e^ [,t 2 ] • This result 
is the same as for the lasso functional (see Proposition |5.3[ ). It implies that infinitesimal 
amounts of contamination do not affect the functional, when the latter is shrunken to 
zero. 
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7 Plots of Influence Functions 


We first compare the effects of different penalties and take a quadratic loss function. We 


consider least squares, ridge and lasso regression as well as the SCAD penalty (10). To 


compute ridge and lasso regression a value for the penalty parameter A is needed, and 
for SCAD another additional parameter a has to be specified. We choose a fixed value 
A = 0.1 and, as proposed by Fan and Li [2001 , we use a = 3.7. 

Influence functions can only be plotted for simple regression y = x/3o + e, i.e. for 
p = 1. We specify the predictor and the error as independent and standard normally 
distributed. For the parameter /3q we use a parameter /5o = 1.5 that will not be shrunk 
to zero by any of the functionals, as well as /?o = 0 to focus also on the sparseness of the 
functionals. Figures [5] and [6] show the plots of the influence functions for least squares, 
ridge, lasso and SCAD for both values of /3o ■ Examining Figure [5} one could believe that 
all influence functions are equal. The same applies for the influence functions of least 
squares and ridge in Figure |6j However, this is not the case. All influence functions 
are different of one another because their bias and the second derivative of the penalty 
appear in the expression of the influence function. Those terms are different for the 
different functionals. Usually, the differences are minor. Note, however, that for some 
specific choices of A and /3q differences can be substantial. For /3o = 0, see Figure [6j 
SCAD and lasso produce a constantly zero influence function. We may conclude that 
in most cases the effect of the penalty function on the shape of the influence function is 


minor. 


To compare different loss functions, we use Huber loss ([8]), biweight loss § and sparse 
LTS ©, each time combined with the Li-penalty J((3) = \/3\ to achieve sparseness. 
For the simple regression model y = x/3q + e, we specify the predictor and the error 
as independent and standard normally distributed and consider /5o = 0 and (3$ = 1.5. 
Furthermore, we fix A = 0.04. 

Figure [7] shows the influence functions of these functionals with Huber and biweight loss 
function. They clearly differ from the ones using the classic quadratic loss for coefficients 
/3o that are not shrunk to zero (compare to panels corresponding to the lasso in Figures [6] 
an d§. The major difference is that the influence functions of functionals with a bounded 
loss function (sparse LTS, biweight) are only unbounded for good leverage points and 
bounded for regression outliers. This indicates the robust behavior of the functionals. It 
is even further emphasized by the fact that those observations (xo, yo) with big influence 
are the ones with small residuals yo ~ ^oAn that is the ones that closely follow the 
underlying model distribution. Observations with large residuals have small and constant 
influence. In contrast, the unbounded Huber loss function does not achieve robustness 
against all types of outliers. Only for outliers in the response the influence is constant 
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Figure 5: Influence functions for different penalty functions (least squares, ridge, lasso 
and SCAD) for /?o = 1.5 with (x'o, yo) £ [—10,10] 2 and the vertical axis ranging 
from —250 to 100 



Figure 6: Influence functions for different penalty functions (least squares, ridge, lasso 
and SCAD) for /3q = 0 with (. xo,yo ) £ [—10,10] 2 and the vertical axis ranging 
from —250 to 100 
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(for a fixed value of xq). However, if the predictor values increase, the influence of the 
corresponding observation increases linearly. For a quadratic loss function the increase 
would be quadratic. Thus, a Huber loss reduces the influence of bad leverage points, but 
does not bound it. For /3(Hq) = 0 and for all loss functions, the Li-penalized functionals 
produce a constantly zero influence function, thus, creating sparseness also under small 
perturbation from the model. To sum up, a Huber loss function performs better than a 
quadratic loss, but both cannot bound the influence of bad leverage points. Only sparse 
LTS and the penalized M-functional with biweight loss are very robust. They are able to 
bound the impact of observations that lie far away from the model, while observations 
that closely follow the model get a very high influence. 


We simulate the expected values that appear in the influence function (16) by Monte 


Carlo simulation (using 10 5 replications). Furthermore, Proposition 4.1 can actually not 
be applied as the lasso penalty is not differentiable. However, using either the tank 


approximation (22) or the same approach as in the proof of Lemma 5.3, one can show 
that the influence function of these functionals equals zero in case the functional equals 


zero and (16) otherwise. 


8 Sensitivity Curves 


To study the robustness of the different penalized M-estimators from Section [7] at sample 
level, we compute sensitivity curves |Maronna et al.[ 20061, an empirical version of the 
influence function. For an estimator (3 and at sample (X, y), it is defined as 


5C(x 0 ,2/o,/3) = 


f3(X U {x 0 }, y U {y 0 }) - $(X, y) 


i 

n+1 


To compute the penalized estimators, we use the coordinate descent algorithm. As 
a starting value, we use the least squares estimate for estimators using a quadratic 
loss, and the robust sparse LTS-estimate for the others. Sparse LTS can be easily and 
fast computed using the sparseLTS function of the R package robustHD. Furthermore, 
we divide the argument of the p-function in @ by a preliminary scale estimate. For 
simplicity we use the MAD of the residuals of the initial estimator used in the coordinate 
descent algorithm. 

Figures [i] and [ 9 ] show the sensitivity curves for estimators /3 with quadratic loss function 
and the different penalties least squares, ridge, lasso and SCAD for parameters /?o = 1-5 
and Po = 0, respectively. We can compare these figures to the theoretical influence 
functions in Figures [5] and [6} Examining Figure [8j we see that for /?o = 1-5, the results 
match the theoretical ones. For /3o = 0, see Figure [9j the sensitivity curve is again 
comparable to the influence function. For the lasso and SCAD, small deviations from 
the constantly zero sensitivity curve can be spotted in the left and right corner. This 


16 














Po = 0 


Po = 0 


Po = 0 



Po = 1-5 


Po = 1 -5 


Po = 1-5 



Figure 7: Influence functions for different loss functions (Huber, biweight, sparse LTS) 
and Li-penalty for f3o = 0 and /3q = 1.5 with (xo,yo) £ [—10,10] 2 and the 
vertical axis ranging from —75 to 40 
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indicates that the number of observations n is too small to get the same results as at 
population level for observations (xo,yo) that lie far away from the model. 

We also compare the results for estimators using different loss functions. Therefore 
we look at sparse LTS and the Li-penalized Huber- and biweight-M-estimators, as in 
Section [7j Their sensitivity curves are plotted in Figure [l0| They resemble the shape of 
the influence functions in Figure [7j 

To conclude, we may say that the sensitivity curves match the corresponding influence 
functions. 


9 Asymptotic Variance and Mean Squared Error 

We can also evaluate the performance of any functional T by the asymptotic variance, 
given by 


ASV (T, H) = n ■ lim Var T n , 

71 — 1-00 


where the estimator T n is the functional T evaluated at the empirical distribution. A 
heuristic formula to compute the asymptotic variance is given by 


ASV (T, H) = J IF((xo, m ),T,H) • TF((x 0 , y 0 ), T, ff)' eUL((x 0 , yo))- 


(25) 


For M-functionals with a smooth loss function p and smooth penalty J, the theory of M- 


estimators is applicable [e.g. Huber, 1981; Hayashi, 2000 . For the sparse LTS-estimator 


a formal proof of the validity of (25) is more difficult and we only conjecture its validity. 


For the unpenalized case a proof can be found in Hossjer, 1994 


Using formulas of Sections 0-0 the computation of the integral ( |25[ ) is possible using 
Monte Carlo numerical integration. We present results for simple regression. 


Figure 11 shows the asymptotic variance of six different functionals (least squares, 


lasso, ridge, biweight loss with Li-penalty, Huber loss with L i-penalty, sparse LTS) as 
a function of A for /3 q = 1.5. As the asymptotic variance of least squares is constantly 
one for any value A and /3o, it is used as a reference point in all four panels. All sparse 
functionals show a jump to zero in their asymptotic variance after having increased 
quickly to their maximum. This is due to parameters estimated exactly zero, for values 
of A sufficiently large. In the left upper panel, the asymptotic variance of ridge is added. 
It is smaller than the asymptotic variance of least squares and decreases monotonously 
to zero. Generally, for the optimal A, least squares has high asymptotic variance, ridge a 
reduced one. The smallest asymptotic variance can be achieved by the sparse functionals. 
But they can also get considerably high values for bad choices of A. We omit the plots 
for /3o = 0 because the asymptotic variance of ridge behaves similarly as in Figure 11 
and the asymptotic variance of the other, sparse functionals is constantly zero. 


18 

















-250 to 100 




















ASV ASV 


Lasso 


Biweight + LI 



21 


























Po = 0.05 
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Figure 12: Mean squared error of various functionals (A = 0.1 fixed) 


In general, robust functionals have a bias (see Section [3]). Hence, considering only the 
asymptotic variance is not sufficient to evaluate the precision of functionals. A more 
informative measure is the Mean Squared Error (MSE) as it takes bias and variance into 
account 


1 


MSE(T, H) = —ASV(T, H) + Bias(T, H) Bias(T, H)'. 


n 


(26) 


Figure [12] displays MSE as a function of n for /3o = 0.05 and 1.5, A = 0.1 is fixed. We 
only present results for simple regression as they resemble the component-wise results in 
multiple regression. 


Looking at Figure 12, the MSE of least squares is the same in both panels as least 
squares has no bias and its asymptotic variance does not depend on /3o - It decreases 
monotonously from one to zero. The MSEs of the other functionals are also monotonously 
decreasing, but towards their bias. For /3q = 0.05, MSE of ridge is slightly lower than 
that of least squares. The MSEs of the sparse functionals are constant and equal to 
their squared bias (i.e. /3 q as the estimate equals zero). For (3q = 1.5, MSE of biweight 
is largest, MSE of sparse LTS is slightly larger than ridge and MSE of the lasso and 
Huber is similar to least squares, which is the lowest. We again do not show results for 
f3o = 0 because then no functional has a bias, and we would only compare the asymptotic 
variances. 

We also show the match at population and sample level for the MSE. For any estimator 
/3 computed for r = 1 ,R samples, an estimator for the mean squared error (26) is 

1 * 


MSE0) = -J2(Pr~ft o) 2 - 


r =1 
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LS: p 0 = 0.05 


Ridge: p 0 = 0.05 


Lasso: p 0 = 0.05 


if) 
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- sample level 


if) 
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n 



Biweight + L-i: p 0 = 0.05 Huber + : p 0 = 0.05 


spLTS: p 0 = 0.05 



Figure 13: Convergence of MSE(/3 ) to MSE(f3o, Ho) for different functionals with /3q 
0.05 


For the six functionals (least squares, ridge, lasso, biweight-M wih Li-penalty, Huber-M 


with Li-penalty and sparse LTS) used in this section, Figures 13 and 14 illustrate the 
good convergence of n ■ MSE(/3 ) to n ■ MSE(f3o , Hq) for /3o = 0.05 and 1.5, respectively. 


10 Conclusion 


In this paper we computed influence functions of penalized regression estimators, more 
precisely for penalized M-functionals. From the derivation of the influence function, we 
concluded that only functionals with a bounded loss function (biweight, sparse LTS) 
achieve robustness against leverage points, while a Huber loss can deal with vertical 
outliers. Looking at the MSE, sparse LTS is preferred in case of bad leverage points and 
the Li-penalized Huber M-estimator in case there are only vertical outliers. 

Apart from considering the influence function, a suitable estimator is often also chosen 


with respect to its breakdown point [see for example Maronna et al. 2006 . This second 


important property in robust analysis gives the maximum fraction of outliers that a 
method can deal with. While it has already been computed for sparse LTS |Alfons et al. 


2013 , it would also be worth deriving it for the other robust penalized M-functionals 
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Figure 14: Convergence of MSE0 ) to MSE(/3q, Hq) for different functionals with /3q = 


1.5 
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mentioned in this paper. 

As any study, also this one is subject to some limitations. First of all, we assumed 
in our derivations the penalty parameter A to be fixed. However, in practice it is often 
chosen with a data-driven approach. Thus, contamination in the data might also have 
an effect on the estimation through the choice of the penalty parameter. Investigation 
of this effect is left for further research. 

Another limitation is that the values of the tuning constants in the loss functions of the 
M-estimators were selected to achieve a given efficiency in the non penalized case. One 
could imagine to select the A parameter simultaneously with the other tuning constants. 

Finally, in the theoretical derivations (but not at the sample level) we implicitly assume 
the scale of the error terms to be fixed, in order to keep the calculations feasible. While 
the results obtained for the lasso, the ridge and the sparse LTS functional do not rely on 
that assumption, the results for biweight and Huber loss do. 
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APPENDIX - Proofs 


Proof of Equation\13\ Recall that we are in the case p = 1. For any joint distribution 
0,2/) H with /3lasso(H) A 0, minimizing the objective function in Q and solving 
the resulting first-order condition (FOC) for @lasso{H) yields 


Plasso{H) = Pls{H) ~ 


A 


sign {/ 3 lasso(H)). 


(27) 


^h[x 2 ] 

We will now consider two different cases. First we consider the case that the lasso 
functional is not zero at distribution H. We will show that it then always has to 
have the same sign as the least squares functional /3ls(H). We start with assuming 
sign(/3 lasso{H)) A sign (/3ls{H)) and show that this will lead to a contradiction. In 
this case (3ls(H ) = 0 is not possible for the following reason. If /3ls(H) = 0, then 
/3 = 0 minimizes the residual sum of squares. Furthermore, the minimum of the penalty 
function is attained at f3 = 0. Hence, (3 = 0 would not only minimize the residual sum 
of squares, but also the penalized objective function, if /3ls(H) = 0. Hence, the lasso 
functional would also be zero, which we do not consider in this first case. Thus, take 
Pls(H) > 0. From our assumption it would follow that sign (/3lasso{H)) = —1 (as 
Plasso(H) = 0 is considered only in the next paragraph) and together with the FOC 
this would yield the contradiction 0 > Plasso{H ) = (3ls(H) + A/E h[x 2 } > (3ls(H) > 0. 
Analogous for (3ls(H ) < 0. Hence, for (3lasso{H) A 0 the sign of the lasso and the least 
squares functional are always equal. 

Let’s now consider the case where the lasso functional is zero at the distribution H. 
The FOC then makes use of the concept of subdifferentials [BertsekasJ 1995j| and can 
be written as \/3ls{H)\ < A/E h[x 2 ]- On the other hand, if \/3ls{H)\ < A/E h[x 2 } as¬ 


suming Plasso{H) / 0 leads to a contradiction since Equation (27) would imply that 
sign(/ 3 LASSo(H )) = — sign {( 3 lasso(H))- Thus, the lasso functional equals zero if and 
only if \/ 3 ls{H)\ < A/E h[x 2 ]- Therefore the lasso functional for simple regression is 
<I3|). □ 


Proof of Lemma 3.1. As x ~ A/"(0, E) and e ~ AA(0, a 2 ) are independent, y — x/3 is 
normally distributed y — x/3 ~ AA(0, a 2 + ( (3q — /3) 2 S) for any j3 E M. Defining cr 2 (/3) : = 
a 2 + (/3o - /3) 2 E) we find qy = We also introduce q Q = $“ 1 (^). With 

this we can rewrite the expected value of the objective function © 


^H 0 [{y x/3) I[\y_x/3\<qp]] 


^ 2 (/3)E Hq 


(y - x/3) 2 T 
ct 2 {P) 


= a 2 (/3)Ez[Z 2 1[\z\< qa }} with Z ~ A/"(0,1) 

= a 2 (f3)(—2q a (j)(q a ) + a). 


(28) 
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Denoting c\ := a — 2q a cj>(q a ), we can say that 

Ps P lts{H 0 ) = argmin cicr 2 (/3) + a\\/3\. 

/3eR 

Separating into (3 > 0 and [3 < 0, differentiating w.r.t. (3 and setting the result to 0 gives 


Equation (15). 


□ 


Proof of Proposition \J t . 1\ The objective function ([6]) is minimized by solving the first- 
order condition (FOC), the derivative of the objective function set zero. At the contam¬ 
inated model with distribution H t := (1 — e)Ho + e<5( X0! y 0 ) this yields 

-VhMv - x'/3 M (tf £ ))x] + 2A = 0. 

Here is used as an abbreviation for { J'{J3\ ..., J'(f3 p (H e )))' and 5( X0 ,y 0 ) 

denotes the point mass distribution at (xo,yo)- 

Using the definition of the contaminated distribution H e , the FOC becomes 

-(1 - e)E Ho [^(y - x'/3 M (i? e ))x] - eif(y 0 - x / 0 /3 M (F e ))x 0 + 2XJ'(f3 M (H e )) = 0. 

Derivation with respect to e yields 

^H 0 [^(y - x'/3 M (He))x] - (1 - e)E/f 0 [^ (y - x'/3 M (i0))x(-x' ^-(3 M {H e ))\ 

r ii / 0 

- 2%o - x'o (3 M (H e ))x Q - e-0 (yo - x 0 /3 M (i7 e ))x 0 (-x 0 —/3 M (i7 e )) 

+ 2A ((3 M (H e )))-^-(3 M (H e ) = 0, 

where diag( J"((3 M (H e ))) denotes the diagonal matrix with entries 
{J"((/3 M {He)) 1 ), • • •, J''({/3M(He)) P )) hr the main diagonal. 

Since §~ e [f3 M (H 6 )\ | e=0 = IF((x 0 ,y 0 ), (3 M , H 0 ), 

^H 0 [i>(y ~ x'/3 m (Ho))x] +E jffo [V’ , (y - x , /3 m (^o))xx / ] • IF((x 0 ,y 0 ),/3 M ,H 0 ) (29) 

- "0(2/0 - x , 0 /3 M (i2 0 ))x 0 + 2Adiag(J"(/3 M (H 0 ))) • IF((xQ,y 0 ), (3 M , H 0 ) = 0, (30) 
Solving (30) for /F((xq, yo)> /3 a/, i7o)j gives Equation (16). □ 


Proof of Lemma\5.1\ Using the explicit definition of the lasso functional (13), its influ¬ 


ence function can be computed directly. Thus, we differentiate the functional at the 
contaminated model H e = (1 — e)Ho + e$(x 0 ,yo) with respect to e and take the limit of e 
approaching 0 


IF((xo,yo), Plasso,Ho) = 

sign((l - e)E Ho [xy] + ex 0 y 0 ) 


d_ 

We 


(1 - e)E Ho [xy\ + ex 0 y 0 


(1 - e)E Ho [x 2 } + ex\ 


A 


d 

= — [sign((l - e)E Ho [xy\ + ex Q y Q )\ 


d 


6=0 


^H 0 [xy\ 


E h 0 [x 2 


X 


E Ho [x 2 


+ sign (E# 0 [xy]) — 


\( 

(1 - e)E Ho [xy\ +ex 0 y 0 

A ^ 1 


(1 - e)E Ho [x 2 ] + exg 

(1 -e)E Ho [x 2 ] + ex 2 0 J + \ 


(1 - e)E Ho [x 2 } + c.Tq 

+ 


e=0 
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While the derivative in the first summand equals zero almost everywhere, the derivative 
occurring in the second summand has to consider two cases separately. Using the fact 
that E Ho [xy\/E Ho [x 2 ] = f3 LS {H 0 ) = /3 0 , we get 


d 

17 

(1 - e)E Ho [xy\ + ex 0 yo 

A ^ 1 

de 

V 

(1 - e)EH 0 [z 2 ] + exl 

(1 - e)E Ho [x 2 } + exl) + 


= 


= 


0 if — g A r ->1 < Bq < A r 31 

En 0 [x 2 ] ~ Efr 0 [x 2 ] 

Sign ( (“ E »oN]+^02/o)EH 0 [* 2 ]-E ffo [xj/](-E Ho [x 2 ]+x2)\ X(-E Hq [x 2 }+x 2 0 ) ^ erwise 

1 B \^\^V l fE Hn b 2 l ) 2 / (V Hn \x 2 ]) 2 .. 


if - 


1 H 0 


jx 2 ] - ^0 < j 


S i gn (/3«) ( "igp” 1 ) - otherwise. 


Thus, almost everywhere the influence function equals (18). 


□ 


Proof of Lemma \5.S\ Differentiating the lasso functional of the coordinate descent al¬ 
gorithm 


/3f(H) = sign (e h xj(y - /3* (j) ) 


Ilf 

E h 

Xj(y — j3* 


E h [x 2 ] 


E h [x 2 } 


for the contaminated model (x, y) r^j He = (1 - e)H 0 + e<5 (x0)2/o) yields 


IF((xo,y 0 ),Pf,H 0 ,F') = 


d r 


sign E Hi 


de L 
+ sign ( E Hq 


Xj (y — x^ (3*^\e 


e=0 


^H 0 [xj(y - x^) / /3* (j) ] 


V Ho [x 2 


E H 0 [x 2 


Xj [y — x*^ f3*^ 


d_ 

~de 


IE H e [ x j(V ~ x^)'/3* (j) (e))] 


^hAx 2 ] 


®H'[X?]J + j 

(31) 


+ 


e=0 


Note that the fixed values /3*(e) depend on e, as they may depend on the data, e.g. if 
they are the values of a previous coordinate descent loop. (3 is used as an abbreviation 
for /3*Cf)(0) and /F((xo, yf), (3*^\ Hq) is shortened to IF(f3*^). 

The derivative of the sign-function equals zero almost everywhere. For the derivation 
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of the positive part function two different cases have to be considered 


(1 - e)E Ho [xj (y - x (j)/ /3* { - ?) (e)^] + e(x 0 ) i [y Q - x^ /3* 0) (e)) 


(1 - e)E Ho [x 2 A + e(x 0 F 


(1 - e)E Ho [x 2 j \ +e(x 0 )^ 


e=0 


= 


if 


E n 0 l x j] 


< 


e h 0 [^] 


sijfcn- 


E go [^(;/-xO)'/3*»))] y (-E Ho [a: J '(y-xO)'/3'*0))]+(-E ffo [^xO') , /F(/3*^))])+(xoh( 2 /o-x^'/3'*0)))E Wo [a;2] 


e h 0 [rf] 


+ 


( e h o [^0 

- E Uo[ 3: j(y- x(j:> '^* 0 ' ) )](- E g 0 [^]+( x t))j)^ -A(-E go [^]+(x 0 )|) 


KoHD 


= 


if 


( E H 0 [^1) 

^H 0 [xj{y -x^’ p*^)\ 


— otherwise 


< A 


sign(E# 0 [xj(y - x^’ 


-Eh 0 [xj^^'IF(( 3 * < - :l '>)]+(xo)j (yo-^o J> /3* W 


---— ( x o) ■ 


E fln h ?) 


e h„[*?] 




V 


otherwise. 

( E «o [*?D 


(32) 


Using the result of Equation (32) in (31) and denoting := y — x^ /3*^ yields 
influence function (l20l). □ 


Proo/ of Proposition \5.3[ W.l.o.g. ft lasso = (/3,0,..., 0)' with /3 € M fc and /3j / 0 V) = 
1,..., fc. At first, we only consider variables j = 1,..., k. For them, the first-order 
condition (FOC) for finding the minimum of ([7]) yields 


(—2E#[x(y - x , ^ LASrfiro (fl'))] + 2\sign(f3 LASSO {H))). = 0 j = 1,..., k 

Let (x, y) rxj Hq denote the model distribution and H e the contaminated distribution. 
Then the FOC at the contaminated model is 


-(1 - e)E Ho [xj(y - x (3 LASSO (H e ))\ - e(x 0 )j(y - x' 0 f3 LASSO (H e )) + A sign ({pLASSo(H e ))j) = 0. 
After differentiating with respect to e, we get 

Etfo [xj{y - x'p LASSO (H e ))\ + (1 - e) ^E Ho fax'] df3 LASso( H e ) ^ _ 

- ( x o)j (y - x 0 PlASSo(Hc)) + e ( x o)j ( x 0 ^ LAS Q ^ >iy — = °- 

Taking the limit as e approaches 0 gives an implicit definition of the influence function 
for j = 1,..., k 

EH 0 [xjx'} ■ IF((x 0 ,y 0 ) 1 (3 LASSOl H 0 ) = ( 33 ) 

= ( x o )j(y ~ x ! oPlasso( h o )) - ^H 0 [xj(y - xf3 LASSO {H 0 ))}. 
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For variables j = k+ 1, ..., p with (P lasso) j = 0 ) we need to use subgradients |Bertsekas 


1995 to get the FOC 

0 G — E#[x(y - x {3 lasso (H))\ + A • d (\\Plasso(H)\\i) ■ 

Observing each variable individually yields 

I 1 Ejf [xj{y - xf3 LA sso(H))\ I < A. (34) 

The coordinate descent algorithm converges for any starting value f3* to (3 L as so I Fried¬ 


man et al., 2007 Tseng 2001, i.e. after enough updates (3* ~ Plasso- Thus, for 


(Plasso( h o))j = 0 and ( x > y) ~ H o, Equation (|34]) yields 

^H 0 [xj(y - x li), /3* (3) )] < A. 

Lemma 15.21 tells us then that 

TF((x 0 ,y 0 ), (f3 LASSO )j, H o) = ° Vj = k + 1,. .. ,p. 

With this we can rewrite Equation ( |33[ ) as 

EH 0 [xi;fcxi :fc ] • IF((x 0 ,y 0 ),(J3 LAS s O )l:k,H 0 ) = 

= (xo)i :k(y - x'oPlasso( h o)) - ^H 0 [xi-.k(y - x'/3 lasso( h o))}- 

Multiplying with E# 0 [xi^x^.J -1 from the left side, we get the influence function of the 
lasso functional (21). □ 


Proof of Lemma 5.f. We apply Proposition 4.1 with a quadratic loss function and use 
the second derivative of the penalty function Jk 


J'kUPk)^ — 


J K((0K)j) = : a j j = l,...,k 

2 K j = k + l,...,p. 

W.l.o.g. we take a = 1. This gives the influence function of P k (Hq) 

7F((x 0 , yo),0 K , Ho) = (K^ 0 [^'] + A diag(4((/3 A -)i), ■ ■ •, J'M3 K )k), 2 K, ■■■, 2 K))- 1 

■ (( yo ~ x-oPk (F7 0 ))x 0 - E Ho [(y - x/3 K (i7 0 ))x]) 

The covariance matrix E h 0 [xx 7 ] can be denoted as a block matrix 

„ r /i (^11 Ei 2 \ 

E « 0 [xx]=L F • 

\E 21 ^22 ) 

The inverse matrix needed in the influence function is then 

(EfJxx'] + Adiag(J^((/3 A -)i), • • •, JkPPk\), 2K ,..., 2it'))" 1 = 

_ /E n + Adiag(J"((/3 K ) 1:fc )) E V1 \~' 

\ E 2 1 E 22 + 2\KI p _ k ) 


(35) 
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The inverse of the block matrix can be computed as 

, r ;i , -i f A - 1 + AE 12 C- 1 E 21 A ~ 1 -A~ x E 12 C~ 

(lEij 0 [xx ] + A diag(0,..., 0, 2K ,... , 2K))~ X = I 12 21 12 

\ — C hj2iA G 

with C = E 22 + 2A/\/ p _fc — E 2 iA~ l E 12 and A = En + Xdiag(J'^(((3 K )i : k)) [see Magnus 


and Neudecker, 2002, pll]. 


We denote the eigenvalues of matrix D = E 22 — E 2 \E^E\ 2 by ..., v p -k- Then the 
eigenvalues of the symmetric positive definite matrix C are v\ + 2A K ,..., v p -k + 2XK. 
If K approaches infinity, these eigenvalues also tend to infinity. Hence, all eigenvalues 
of C~ x converge to zero. Thus, C _1 becomes the zero matrix and therefore the inverse 


matrix in (35) converges to 


lim (E h q [xx'] + A diag(0,..., 0, 2K ,..., 2 K)) = 

K—>oo 


'Eu 1 0 N 

, 0 0 , 


This gives the influence function of the lasso functional (21) as the limit of IF((xo,yo), (3^, Hq) 
for K —> oo. □ 


Proof of Lemma 6.1. As the sparse LTS functional is continuous, the influence function 
of the sparse LTS functional equals 0 if /3 sp lts(Hq) = 0. Thus, assume from now on 
PspLTs(Ho) / o. 

The first-order condition at the contaminated model H e = (1 — e)LTo + ^(x 0 ,y Q ) yields 

rq c ,p 


n d 
0 “ dp 


u 2 dH^{u) ] +aAsign(/3) =: \k(e, P). 


(36) 


~9c,/3 


Note that here the quantile q €j( g as well as the joint model distribution Hf of x and 
y depend on /?. We denote the solution of (36) by j3 e := P S pLTs(H e ) for e / 0 and 
Ps P lts(H o) otherwise. 

As (36) is true for all e £ M + , the chain rule gives 


o = ^[^(e, Pe)]\e=o = *i(0,Ps P lts(H 0 )) + 'I / 2(0, f3spLTs{Ho))IF((3 spLT s) 
IF(PspLTs) = -[^ 2 ( 0 , PspLTsiHo))]- 1 ^!^, PspLTs{Ho)) 


(37) 


where ^ 1 ( 0 , 6 ) = J^(e, 6 )\ e=a and T 2 (a, b) = ^^(a, P)\p =b . 

Before computing Ti(0, Ps P lts(H 0 )) and ^ 2 (0, Ps P lts(Ho)), we can simplify ^{e,P) 
by using Hq = A/"(0, cr 2 (/3)) with a 2 ((3) = a 2 + (P S pLTs{Ho) — /3) 2 S, as x ~ A7(0, S) and 
e ~ A/"(0, u 2 ) 

d ( a \ 

^( e,/3 ) = d/3 1 ( 1 - £ ) J u2<1H o( u ) + eI [\yo-xo^\<q^y° ~ x oP) 2 J +aAsign(/3) 


-(‘-4 


rqe ’P v 2 u 




</>(- nE.) du - 2ex o(yo - x oP)I{\y 0 -xo/3\<q, tfi \ + aAsign(/3) 
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and the Leibniz integral rule 


3 

dp 



u , u 

o{P) n a{P) 



u 


^ 2 <K ^)(1 

a(P) 


u 2 ^.(A) - /3)S , 

3 / n\ 

<7 {P) <7 6 {P) 


0 ^e,/3 , / Qefi \ V \ l 

+ <7(/3)^a(/3)W 9e ’^ ' 


To obtain the derivative Ti(0, PspLTs(Ho)), we can again use the Leibniz integral rule 


4T(0, PspLTs{Ho)) = 

f^,p svLTS (H 0 ) 2 _ u _._ u 2 (/3 0 -/3 spLT s(^o))S 

- 1o,p spLTS (H 0 ) ^spLTs(Ho)) V 2 (PspLTs{H 0 )) V 3 {PspLTs{H 0 )) 


2 

%,Ps P LTs{H 0 ) ,, %,PspLTs(Ho) . 3 . .. 

^TR~PP^TrP\\^ ^ fl/3 I P=PspLTS (#o) 


d 

+ ¥e 


f qe ’PspLTS( H 0) 2 
U 

~ qe ’l 3 3pLTS( H 0) 


_l_ g i/ ' _, 

&(PspLTs(H o)) v{PspLTs(H 0 )) 3/3 
: j./ « w-i_ ^ \ j 1 (A> - Ps P LTs{Hq))Z 

v(PspLTs(H o)) V 2 (PspLTs{H o)) J e=0 & 3 (PspLTs(Ho)) 


+ 


^O^spLTsiHo) 3 . II ,/ QOlPspLTsiHa) N 3 r ,, 

2 

^0,/3 3 pLTs(Ho) %,PspLTs{Ho) 1 3 . .. 1 3 

ct(Ps P lts(H 0 )) * ( a(p spLTS (H 0 )) } ^^(* 0 )] |e= 0 a(p spLTS (H 0 )) 3£ [g °’^I^=^ ts(Ho)+ 


2 

r. ^0,f} sp LTs(Ho) , , %,PspLTs{H Q ) .3.3. i| 

~ 2xq(?/ 0 ~ x o/3spLT5(-f^o))^[|j /0 -a;o/3 S pLTs(fi'o)|<go,/3 spiTiS (n 0 )]’ 


To compute the derivatives of the quantiles, we denote the distribution of |y — x'/3| 
by Lff when (x, y) ~ iL e . Using the equations H^(q e ,P) = a and Hj\qo, P) = a and 
differentiating w.r.t. the required variables yields 
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a — I\ 


Q e [Q£,PspLTs(H o)] l f —0 


3 


0/3 [t/o,/?] l / 3 =/ 3 S piTs (- ffo ) 


[|j/o-a;o/3 S piTs(^o)|<go,/3 S p iTS (H 0 )] 
a(fi 3pL l s (H 0 )) 

%,Ps P lts(Ho)(Po - PspLTs(H o))£ 


3 / 3 L -,M J .p=p 3 p iTS ^o; ^(/3, P LT5(^o)) 

^ r 5 r II II _ J [|ro|< 9 Q ] - « (A)- P S pLTs(H 0 ))Y, 

de l d p[Qe,p\\p=p spL Ts(H 0 )\\ e -0 mqa) ■ a{ p spLTs{Ho)) 


with ro : = 
Thus, 


yo—xof) a pLTs(.H 0 ) 

a (PspLTs(H ())) 


d / i(0,/3 S pLrs(-ffo)) =(-4 q a p(q a ) + 2a + 2g^(/[| ro |< ?ct ] - a))(/3 0 - Ps P lts{Ho ))£ (38) 

- 2x 0 (yo - XoPspLTs(Ho))I[\r 0 \<q a }- (39) 

With similar ideas as in the derivation of Ti(0, P S pLTs(Ho)), we get 

^ 2 ( 0 , Ps P lts(H 0 )) = (—4 q a (j)(q a ) + 4$(g a ) - 2 )E. (40) 


32 



Using (39) and (40) in (37), we get the influence function (24) of the sparse LTS 
functional for simple regression. □ 
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